Closes #169 #405

giyaseddin · 2022-04-10T10:18:23Z

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

giyaseddin · 2022-04-10T10:21:07Z

biodatasets/medquad/medquad.py

+    f"QATestSetMedQrels_judged_answers": f"{_DATA_PATH}/QA-TestSet-LiveQA-Med-Qrels-2479-Answers.zip",
+}
+
+_SUPPORTED_TASKS = [Tasks.QUESTION_ANSWERING]  # TODO: shall we add a non-existing task type such as `RQE`?


In the issue description, it says it supports QA and RQE, is it enough to put the _SUPPORTED_TASKS this way?

Mmm... I think we should get away by using constants.Tasks.TEXTUAL_ENTAILMENT

Quiting from the readme:

We included additional annotations in the XML files, that could be used for diverse IR and NLP tasks, such as the question type, the question focus, its syonyms, its UMLS Concept Unique Identifier (CUI) and Semantic Type.

So it seems it could be possible to add NAMED_ENTITY_RECOGNTION/DISAMBIGUATION to this but not sure...

The data provided in the XML files doesn't seem to be structured as NER e.g. this sample
I'll take another look to see if I could possibly parse them.

giyaseddin · 2022-04-10T10:22:19Z

biodatasets/medquad/medquad.py

+
+_HOMEPAGE = "https://github.com/abachaa/MedQuAD"
+
+_LICENSE = "https://creativecommons.org/licenses/by/4.0/legalcode"  # TODO: terms aren't available in the repository! In the issue it is 'CC BY 4.0'


Also, in the description it says CC BY 4.0 is the license type, so I search the license terms online, I just wanted to make sure this is correct.

Where did you find the license of the dataset? I cannot seem to find it...

If you check the license field in #169 description, you would see it, although I couldn't find it mentioned in the repo.
I'm not sure, but what I did is assume the license in the desc is correct, I looked for "terms of CC BY 4.0" and pasted the link :)
What do you think it should be @sg-wbi ?

giyaseddin · 2022-04-10T10:25:28Z

This PR closes #169

sg-wbi

@giyaseddin Thank you very much for your contribution! Oh my, this seems to be a nasty one... Could you please check my comments?
I would like to help you out more, but first we should get the datalaoder to download the data in a reasonable amount of time/steps and see what we've got. Thanks!

sg-wbi · 2022-04-11T11:29:29Z

biodatasets/medquad/medquad.py

+
+_HOMEPAGE = "https://github.com/abachaa/MedQuAD"
+
+_LICENSE = "https://creativecommons.org/licenses/by/4.0/legalcode"  # TODO: terms aren't available in the repository! In the issue it is 'CC BY 4.0'


Where did you find the license of the dataset? I cannot seem to find it...

sg-wbi · 2022-04-11T11:29:56Z

biodatasets/medquad/medquad.py

+    f"QATestSetMedQrels_judged_answers": f"{_DATA_PATH}/QA-TestSet-LiveQA-Med-Qrels-2479-Answers.zip",
+}
+
+_SUPPORTED_TASKS = [Tasks.QUESTION_ANSWERING]  # TODO: shall we add a non-existing task type such as `RQE`?


Mmm... I think we should get away by using constants.Tasks.TEXTUAL_ENTAILMENT

biodatasets/medquad/medquad.py

sg-wbi · 2022-04-11T11:34:01Z

biodatasets/medquad/medquad.py

+
+        qa_pairs_enriched_fpath = self._dump_xml_to_json(dl_manager)
+
+        # There is no canonical train/valid/test set in this dataset. So, only TRAIN is added.


I think they use this for testing: https://github.com/abachaa/LiveQA_MedicalTask_TREC2017/tree/master/TestDataset

I added this set, but the general scheme doesn't match, I implemented it to barely match the schema.

sg-wbi · 2022-04-11T11:35:19Z

biodatasets/medquad/medquad.py

+    f"QATestSetMedQrels_judged_answers": f"{_DATA_PATH}/QA-TestSet-LiveQA-Med-Qrels-2479-Answers.zip",
+}
+
+_SUPPORTED_TASKS = [Tasks.QUESTION_ANSWERING]  # TODO: shall we add a non-existing task type such as `RQE`?


Quiting from the readme:

We included additional annotations in the XML files, that could be used for diverse IR and NLP tasks, such as the question type, the question focus, its syonyms, its UMLS Concept Unique Identifier (CUI) and Semantic Type.

So it seems it could be possible to add NAMED_ENTITY_RECOGNTION/DISAMBIGUATION to this but not sure...

sg-wbi · 2022-04-14T08:49:12Z

Thank you for checking out my comments @giyaseddin ! I am trying to inspect the dataset with:

[ins] In [7]: from datasets import load_dataset
ds = load_dataset("biodatasets/medquad/medquad.py", "medquad_source")

But I get this error:

    142         raise NotImplementedError("Only `source` and `bigbio_qa` schemas are implemented.")
    144     return datasets.DatasetInfo(
    145         description=_DESCRIPTION,
    146         features=features,
   (...)
    149         citation=_CITATION,
    150     )
--> 152 def _load_qa_from_xml(self, file_paths) -> List[dict[str, str | None]]:
    153     """
    154     This method traverses the whole list of the downloaded XML files and extracts Q&A pairs.
    155     Returns the extracted Q&As and the base directory of the dumped json file that contains them all.
    156     """
    157     assert len(file_paths)

TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Could you please make sure we can load both source and bigbio w/o errors? Thank you!

regel-corpus · 2022-04-20T07:48:33Z

Hey @giyaseddin! Do you plan to work anymore on this?

giyaseddin · 2022-04-20T08:00:08Z

Hey @regel-corpus,
I will push my last modifications ASAP.

giyaseddin · 2022-05-17T18:41:11Z

Could you please check the current if it downloads correctly @sg-wbi?

rosalinesway · 2022-05-17T19:26:07Z

Hi @giyaseddin, I pulled the latest code, and it seems like this error still occurs upon loading. Could you check again if you have fixed it in your updates?

Thank you for checking out my comments @giyaseddin ! I am trying to inspect the dataset with:

[ins] In [7]: from datasets import load_dataset
ds = load_dataset("biodatasets/medquad/medquad.py", "medquad_source")

But I get this error:

    142         raise NotImplementedError("Only `source` and `bigbio_qa` schemas are implemented.")
    144     return datasets.DatasetInfo(
    145         description=_DESCRIPTION,
    146         features=features,
   (...)
    149         citation=_CITATION,
    150     )
--> 152 def _load_qa_from_xml(self, file_paths) -> List[dict[str, str | None]]:
    153     """
    154     This method traverses the whole list of the downloaded XML files and extracts Q&A pairs.
    155     Returns the extracted Q&As and the base directory of the dumped json file that contains them all.
    156     """
    157     assert len(file_paths)

TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Could you please make sure we can load both source and bigbio w/o errors? Thank you!

ruisi-su · 2022-05-28T19:46:13Z

hi @giyaseddin, thanks for putting the effort to continue working on this dataset. Would it be possible to pull the up-to-date master into your branch? There are some inconsistencies between your branch and master, which blocks running the unit tests. Thanks!

Add MedQuAD dataset loader

19726c4

giyaseddin requested review from hakunanatasha, jason-fries, sunnnymskang, ruisi-su, galtay, leonweber and sg-wbi as code owners April 10, 2022 10:18

giyaseddin commented Apr 10, 2022

View reviewed changes

sg-wbi reviewed Apr 11, 2022

View reviewed changes

sg-wbi self-assigned this Apr 11, 2022

Change download for more efficiency

46e2068

giyaseddin requested a review from debajyotidatta as a code owner April 13, 2022 23:08

giyaseddin added 2 commits April 14, 2022 02:33

Prevent unnecessary data extraction

85ef3fd

Add support for data subsets and test

d3545b7

Update download style & add test datset

37fd094

Fix bigbio_qa fromat

8785975

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #169 #405

Closes #169 #405

giyaseddin commented Apr 10, 2022

giyaseddin Apr 10, 2022

sg-wbi Apr 11, 2022

sg-wbi Apr 11, 2022

giyaseddin Apr 13, 2022

giyaseddin Apr 10, 2022

sg-wbi Apr 11, 2022

giyaseddin Apr 13, 2022

giyaseddin commented Apr 10, 2022

sg-wbi left a comment

sg-wbi Apr 11, 2022

sg-wbi Apr 11, 2022

sg-wbi Apr 11, 2022

giyaseddin May 17, 2022

sg-wbi Apr 11, 2022

sg-wbi commented Apr 14, 2022

regel-corpus commented Apr 20, 2022

giyaseddin commented Apr 20, 2022

giyaseddin commented May 17, 2022

rosalinesway commented May 17, 2022

ruisi-su commented May 28, 2022


		_HOMEPAGE = "https://github.com/abachaa/MedQuAD"

		_LICENSE = "https://creativecommons.org/licenses/by/4.0/legalcode" # TODO: terms aren't available in the repository! In the issue it is 'CC BY 4.0'


		qa_pairs_enriched_fpath = self._dump_xml_to_json(dl_manager)

		# There is no canonical train/valid/test set in this dataset. So, only TRAIN is added.

Closes #169 #405

Are you sure you want to change the base?

Closes #169 #405

Conversation

giyaseddin commented Apr 10, 2022

Checkbox

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

giyaseddin commented Apr 10, 2022

sg-wbi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sg-wbi commented Apr 14, 2022

regel-corpus commented Apr 20, 2022

giyaseddin commented Apr 20, 2022

giyaseddin commented May 17, 2022

rosalinesway commented May 17, 2022

ruisi-su commented May 28, 2022